Welcome to STAT 440/CSE 440! These are the course notes which can be found in the stat440-fa21-course-content repository or in GitHub Enterprise. GitHub Enterprise may be called GHE short. The intent of this course is to discuss and demonstrate typical data wrangling and data management tasks for students who are comfortable using R or Python. These notes are written ideally for the student who is unfamiliar with such tasks or who has not seen data wrangling tasks on real datasets. If you are a practicing data engineer, then these notes may not be useful. If you came to these notes looking for information about Python and Jupyter Lab, please view the file with the .ipynb extension in the stat440-fa21-course-content repository or here.
I live in Urbana.
This Photo by Unknown Author is licensed under CC BY
I like Urbana and because I’m a transplant I like to learn more about these twin cities in my spare time. One way I like to learn about things is through looking at data. We live in a technologically advanced world and there just so happen to be data portals about our locality.
The City of Urbana Open Data portal has one particularly interesting dataset, the Rental Inspection Grades Listing Data, that I will use as the main dataset to apply the data management concepts throughout this course and these notes.
With this dataset, you will learn how to
But before we go any further, you should review the course syllabus (found in the syllabus directory of the stat440-fa21-course-content repository or here) and familiarize yourself with the software section. Below I say a bit more about the software used in this course and how it’s going to help us. **I assume you have downloaded these necessary software at this point.*
This course has come a long way since its inception and early well-defined notes of Maria Muyot. This course now has no SAS offerings, but is catering to R and Python software. A recent addition to the course is the use of Git and GHE. Git will serve as the course learning management system (LMS) and as a way for students to appreciate the advances of version control and collaboration.
Markdown is a markup language with special syntax used to craft and design simple yet flexible text documents. Markdown permits users to author HTML, PDF, and MS Word documents. Latex syntax can be included in Markdown syntax as well. Markdown was created by John Gruber in 2004 and has caught on in popularity. Markdown syntax is used as the main text styles within RMarkdown, Jupyter Lab, and Jupyter Notebooks.
Here are some frequent Markdown syntax examples:
bold text ( text enclosed with ** on both sides )
italic text ( text enclosed with * on both sides )
lists (this is a list already) ( mark with - )
tables ( pipes between columns and a new line with ---|--- )
| Variable | Description |
|---|---|
| Var1 | student ID |
| Var2 | height of students |
https://www.markdownguide.org/
R is a statistical programming language widely popular to statisticians in research and academia. It is free, open-source, and is being developed by its users to handle tasks beyond statistics and data science. We can code in R - usually in a script - and saved as a .R file. The output of the code we run is visible in the console.
Click here to download R
This is what R looks like
RStudio is an interface (formally known as an integrated development environment) for R that is relatively user-friendly, visually sufficient, and well-organized for content management. In RStudio, we could do almost everything we need for this course such as R calculations, install and update R packages, write RMarkdown (.Rmd) files and render or “knit” them .html, build websites and Shiny apps, etc.
Click here to download RStudio
This is what RStudio looks like
RMarkdown is a powerful package that allows us to create reproducible documents such that anyone on the planet can look at our code and get the same results. The other nice thing about reproducible documents is that we can have descriptions and directions written in narrative style along with code and results of that code. RMarkdown as a package comes pre-installed with RStudio but can be installed as a package in base R.
RMarkdown documents are saved with the .Rmd file extension. This document that you are reading is an RMarkdown document, which as the name suggests is based on for Markdown syntax.
Thanks to RMarkdown, I can explicitly embed an R code chunk like this:
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
The curly braces and the character “r” are required if you want the code to execute. We always want the code to execute! Also, notice that the word “cars” is written after the “r”. In this case, “cars1” is a label for the code chunk. It is always recommended to have a label for all code chunks, but the labels must be unique for each code chunk. Using non-unique labels prevents the Rmd file from knitting.
For the assignments in this course, students will submit two files to their individual student repository, 1) either .Rmd or .ipynb file and 2) .html file. The .Rmd (or .ipynb) is the reproducible document file that contains your solutions to given assignments. These reproducible documents have a function that allows them to be rendered to a nicer web-browser-friendly format: html. Students need to make sure they always submit both files by the deadline to receive a proper grade. Rendering to html can happen:
Git is a version control system and GitHub Enterprise (or GHE) is one platform for uniting git commands with a collaborative workflow. Git and GHE smooth collaboration by allowing multiple users to work on the same document on their own devices all at the same time. Then those users can submit their updates, describe their updates, and no one needs to rename the file. That file is updated in Git with the same file name it began with. Even if you are working alone on a code file, using version control can help alleviate confusion about what you were doing last time on that file.
Git is the software that allows us to connect to repositories (or repos for short) and collaborate on projects that may exist in GHE. Git does not have an interface, as it is the version control software itself. Usually, we have an interface to use git. That interface can be your local machine’s terminal or command prompt: ‘Terminal’ (macOS/linux) or ‘Git Bash’ (Windows). Another interface can be a git client such as RStudio or GitHub Desktop. I’ll say more about these and how to setup your connection to your repo in the sections below.
Here are some Git terms (in alphabetical order) we need to become familiar with.
branches: pathways of the repo. We can work on branches without affecting the master and this may be useful for experimenting with something without affecting the main project.
cloning: copying an existing repo so that it can be accessed remotely on your local machine
commit: a snapshot of the last saved version of a file and the corresponding message explaining what changed in this version of the file
diff: the set of differences between commits; observing diffs helps keep track of what has changed across two commits
directory: a folder or sub-folder within a repository
fetching: downloading files (from a remote) that are not in your working directory
history: the tracking of all changes to a file
master: the main track of your repo. The master is also considered a branch. GHE may eventually replace the name “master” with a new name due to social justice advocates who oppose the oppressive history of the term “master”
merge conflict: when two separate branches have made edits to the same line in a file, or when a file has been deleted in one branch but edited in the other. Merge conflicts can be fixed by manually editing the problem file and then merging and re-committing/pushing.
merging: combines updated information into one single file and puts that on the master. Merging may be used to resolve conflicts when collaborators commit changes on the same file.
pulling: a single command that does both fetching and merging
pushing: finalizing and formalizing an updated file by adding the changes from your local working directory to GHE
remote: the cloned repo
repo (or repository): the main folder or space in which a project exists
staging (or the staging area): a file that stores information about what’s being committed. We want to stage a file after we’ve updated it in some way so that it can be committed
In this course we are using GitHub Enterprise, which we call “GHE” for simplicity, as an LMS and as a tool for self-collaboration. LMS means that GHE will function as the main course space - students will retrieve assignments, submit their assignments in their individual student repos, see their grades, get course updates and announcements, communicate with each other and with the course staff, access necessary course content, etc. Here self-collaboration means that you will work on a file, such as a homework assignment, and make several commits for that file as you complete the problems. Each time you make a commit, it is best to push the file to your individual student repo to submit it. Thus, your file versions are only collaborated between you and yourself. The course exists in GHE at the website https://github-dev.cs.illinois.edu/stat440-fa21/stat440-fa21-course-content. The course website landing page is the README.md file, which will serve as the Course Announcements. You should check the course website frequently for updates and course announcements. Also, as the syllabus mentions, one of the main ways to communicate is in the Issues board, which functions as a discussion board much like Piazza. To access the Issues board, go to the course landing page and click on the Issues tab.
There are three ways to interact with this course, which means there are three ways to interact with git.
All three will get you the access to course content and for assignment retrieval and submission. But only the first way (Using GHE) will allow you to use the Issues board for posting questions and seeing responses. The second and third ways still require some aspect of the first way because of GHE. Below I discuss the procedures for setting up and connecting your individual student repo to your local machine as well as how to submit a homework00 file in each of the three ways.
In order to complete these procedures successfully, you must first do these two steps:
Step i) Log into GHE at https://github-dev.cs.illinois.edu/login with your netID and Illinois password. If you’ve never used GHE through the University before, logging in will establish your account.
Step ii) Create your individual student repo (named as your netID) by clicking on this link https://edu.cs.illinois.edu/create-ghe-repo/stat440-fa21/. Afterwards, you should see something like this.
Actually, completing the Steps i) and ii) are all that is needed to setup and connect your individual student repo. This isn’t really a local machine kind of setup since you are choosing to interact with the course via using GHE. The course staff will give you access to the main repo and course landing page stat440-fa21-course-content by adding you to the “students” team. If you do not have access to the stat440-fa21-course-content repo 24 hours after completing Steps i) and ii), please contact me at kinson2@illinois.edu.
Inside the stat440-fa21-course-content repo, there should be a minimum of the syllabus directory and the README.md file. See image below. I say “minimum” because these notes are made much earlier than newer files and directories that appear in the stat440-fa21-course-content repo.
You will simply refresh the page of the stat440-fa21-course-content repo to see any updates and course announcements. Refreshing the page equates to pulling the repo since you are using GHE and not working with git locally.
To retrieve assignments, go to the particular assignment directory, homework for example, and click on the assignment you need to complete. I have made a fake homework assignment called stat440-fa21-homework00.md and its rendered html file called stat440-fa21-homework00.html. It should be in the stat440-fa21-course-content/homework portion of GHE. See image below.
To begin the assignment, I advise students to click on the “Raw” button for a particular assignment (for right now, we are using the stat440-fa21-homework00.md).
Then, copy all of the text on that page and paste that into a blank .Rmd (or .ipynb) file depending on your preferred software.
Now, save the file as homework00-netID.Rmd (or .ipynb) inside of your preferred software.
Now, complete the first problem by writing your solution beneath the Problem #1 wording. Then render your reproducible document file to .html. Rendering to html is also called “knitting”. See image below.
To submit your assignments, go to your individual student repo (named as your netID) in GHE and upload the files (both .Rmd and .html). You can upload by clicking the “upload an existing file”, then “choose your files” or simply dragging and dropping the two files into your repo page.
You can do this multiple times for your assignment submissions which is why I say you have unlimited submissions. Just be sure that your reproducible document file (either .Rmd or .ipynb) and rendered file (.html) are up to date with each other. It is not a good idea to complete the assignment in the .Rmd file, but forget to render it to .html.
Be sure to complete Steps i) and ii) above. If you do not have access to the stat440-fa21-course-content repo 24 hours after completing Steps i) and ii), please contact me at kinson2@illinois.edu.
The majority of these steps are discussed in a different way in the reference text Happy Git and GitHub for the useR by Bryan et al. https://happygitwithr.com/.
Now, we are going to clone the stat440-fa21-course-content repo. Cloning this repo will be the first step to accessing the most up to date course content, updates, and announcements. To clone the repo, go to the stat440-fa21-course-content repo on GHE or here. Next, click on the green “Code” button. Then, click the clipboard in order to copy the repo’s URL. See image below.
Now, open the terminal or command prompt on your local machine: using ‘Terminal’ (macOS/linux) or ‘Git Bash’ (Windows). The code below, changes the current directory to the Desktop folder locally and clones the repo. At the cursor in your terminal, type the following in one line:
cd ~/Desktop
git clone https://github-dev.cs.illinois.edu/stat440-fa21/stat440-fa21-course-content.git stat440-fa21
Then press Execute to run that line.
Next, we can verify whether that clone was successful by listing out all files in this new folder called stat440-fa21 on our local machine. The code below, changes the current directory to the stat440-fa21 folder locally, then we list out all files with the ls code. Type the following in two lines executing after each line in your terminal:
cd stat440-fa21
ls
The resulting listing of files should contain a minimum of the syllabus directory and the README.md file. See image below. I say “minimum” because these notes are made much earlier than newer files and directories that appear in the stat440-fa21-course-content repo.
Great! You have successfully connected the stat440-fa21-course-content repo to your local machine. Cloning the repo should happen only once per local machine. Meaning, you should almost never have to re-establish the connection to GHE for this particular repo. One reason you may need to re-clone the repo is if you have deleted the stat440-fa21 folder from your local machine.
To keep your local machine up to date with the latest course content (including course announcements and assignments), you will perform a pull on this remote repo that you called stat440-fa21. To pull the stat440-fa21 repo via the command line, type the following executing after each line in your terminal:
cd stat440-fa21
pwd
git pull
After pulling successfully, you should see a message such as this.
To retrieve assignments, go to the particular assignment directory locally on your machine, homework for example, and click on the assignment you need to complete. I have made a fake homework assignment called stat440-fa21-homework00.md and its rendered html file called stat440-fa21-homework00.html. It should be in the homework sub-folder of the stat440-fa21 folder on your local machine.
To begin the assignment, I advise students to open the original .md file, e.g. stat440-fa21-homework00.md, using RStudio or Jupyter Lab depending on your preferred software.
Now, save the file as homework00-netID.Rmd (or .ipynb). It is good practice to save this file somewhere outside of the stat440-fa21 folder, e.g. your Desktop. Doing so ensures that you know which file you are working on and it reduces confusion about if the file is the original assignment or not.
Now, complete the second problem by writing your solution beneath the Problem #2 wording. Then, render your reproducible document file to .html. Rendering to .html is also called “knitting”. See image below.
Now that your solution to Problem #2 is saved locally, you want to practice submitting the assignment from your local machine to your individual student repo in GHE (not to the stat440-fa21-course-content repo). This is a form of self-collaboration. Recall that submitting an assignment in Git translates to committing and pushing the changes.
To submit your assignments, we must first connect your individual student repo to your local machine, which means we need to clone it.
Now, we are going to clone your individual student repo which is named as your netID. If you have files in your repo already, then to clone the repo, go to your netID repo on GHE. Next, click on the green “Code” button. Now, click the clipboard in order to copy the repo’s URL. See image below.
Now, open the terminal or command prompt on your local machine: using ‘Terminal’ (macOS/linux) or ‘Git Bash’ (Windows). At the cursor in your terminal, type the following executing after each line:
cd ~/Desktop
git clone https://github-dev.cs.illinois.edu/stat440-fa21/netID.git netID
As an alternative, using cd .. moves our current directory up from stat440-fa21 to whatever folder is above it. Your directories on your local machine may be setup differently than mine. Changing the current directory up from stat440-fa21 is a simple way to ensure we aren’t cloning our individual student repos into the stat440-fa21 folder locally.
Next, we can verify whether that clone was successful by listing out all files in this new folder called netID on our local machine. The name netID should be your net ID. The code below, changes the current directory to your netID folder locally, then we list out all files with the ls code. Type the following in two lines executing after each line in your terminal:
cd netID
ls
There shouldn’t be any files in your repo, because we assume this is your first time accessing your individual student repo. Thus you may receive the message Warning: You appear to have cloned an empty repository.
Now, you need to copy the homework00-netID.Rmd and homework00-netID.html files from their current location into your netID folder. Go to the terminal and type the following executing after each line in your terminal depending on your OS:
Windows (if not using Git Bash)
cd ~/Desktop
copy homework00-netID.Rmd netID
copy homework00-netID.html netID
or
Mac/Linux (and Windows Git Bash)
cd ~/Desktop
cp homework00-netID.Rmd netID
cp homework00-netID.html netID
Copying these files in this way will only work if you are copying them in the current directory. Otherwise, you will need to add information about the file location such as copy Desktop/homework00-netID.Rmd Desktop/netID. In this example, the homework file is on the Desktop and the netID folder is in the Desktop. Now that those assignment files are in our local remotes, we need to actually submit them to our individual student repo in GHE. Go to the terminal and type the following executing after each line:
cd netID
git add homework00-netID.Rmd homework00-netID.html
git commit -m "Added two homework files from local machine"
git push origin master
Great! We have successfully submitted our first (fake) homework assignment resulting see the image below. You can verify whether any commit and push has been successful by going to GHE and checking if the file is there in the location you intended and that the commit message is also present and correct.
You can submit assignments an unlimited number of times for your assignment submissions which is why I say you have unlimited submissions. Just be sure that your reproducible document file (either .Rmd or .ipynb) and rendered file (.html) are up to date with each other. It is not a good idea to complete the assignment in the .Rmd file, but forget to render it to .html.
Be sure to complete Steps i) and ii) above. If you do not have access to the stat440-fa21-course-content repo 24 hours after completing Steps i) and ii), please contact me at kinson2@illinois.edu.
The majority of these steps are discussed in a different way in the reference text Happy Git and GitHub for the useR by Bryan et al. https://happygitwithr.com/.
The notes below are for RStudio which can be a git client. There are other git clients such as GitHub Desktop. If you prefer to use a git client that is not RStudio, please follow the directions in those clients.
We assume you have not previously cloned this or any repo mentioned below. We are going to point to the Desktop in these steps. Cloning will not work if you already have a folder on your local machine’s Desktop directory called “stat440-fa21.” If you do have a folder called “stat440-fa21” in your Desktop, then delete it or use a different name such as “stat440.” Deleting local folders does not affect the repo in GHE ~ which is why version control and GHE is so powerful!
Now, we are going to clone the stat440-fa21-course-content repo with RStudio. Cloning this repo will be the first step to accessing the most up to date course content, updates, and announcements. To clone the repo, go to the stat440-fa21-course-content repo on GHE or here. Next, click on the green “Code” button. Then, click the clipboard in order to copy the repo’s URL. See image below.
Now, open RStudio, click on “File”, then “New Project…”, then “Version Control”, then “Git”. See images below.
Now, paste the repo’s URL (copied from above) in the “Repository URL” field. Type “stat440-fa21” in the “Project Directory Name” field because that will be the name of the folder that is connected to the stat440-fa21-course-content repo. In the “Create project as sub-directory of” field, select the Desktop; in the future, it could be anywhere of your choosing.
Next, we can verify whether that clone was successful by checking our computer’s Desktop for a folder named “stat440-fa21.” Remember that we assume this folder didn’t exist before. If there was already a folder with that same name, then the clone would not be successful. One quick resolution will be to delete the “stat440-fa21” folder from the Desktop and re-do this cloning procedure. The resulting set of files should contain a minimum of the syllabus directory and the README.md file and a new file called stat440-fa21.Rproj.. See image below. I say “minimum” because these notes are made much earlier than newer files and directories that appear in the stat440-fa21-course-content repo.
Great!
Another way to verify that you have successfully cloned with RStudio is by noticing RStudio has opened your stat440-fa21 directory via its stat440-fa21.Rproj file. This .Rproj file is a file that RStudio creates to keep up with projects that have been created. This file is something you keep locally; do not commit and push this to GHE. Also, notice that inside of RStudio, you now have a new Git tab in the top-right pane. This Git tab becomes our central interface for interacting with Git.
Even greater! You have successfully connected the stat440-fa21-course-content repo to your local machine with RStudio. Cloning the repo should happen only once per local machine. Meaning, you should almost never have to re-establish the connection to GHE for this particular repo. One reason you may need to re-clone the repo is if you have deleted the stat440-fa21 folder from your local machine.
To keep your local machine up to date with the latest course content (including course announcements and assignments), you will perform a pull on this remote repo that you called stat440-fa21. To pull the stat440-fa21-course-content repo via RStudio, click on the Git tab, then click on the blue down arrow, which means to pull down the repo.
After pulling successfully, you should see a message such as this.
Already up to date is not necessarily what you want to see. You just want to see that the pull was successful.
To retrieve assignments, go to the particular assignment directory locally on your machine, homework for example, and click on the assignment you need to complete. I have made a fake homework assignment called stat440-fa21-homework00.md and its rendered html file called stat440-fa21-homework00.html. It should be in the homework sub-folder of the stat440-fa21 folder on your local machine.
To begin the assignment, I advise students to open the original .md file, e.g. stat440-fa21-homework00.md, using RStudio.
Now, save the file as homework00-netID.Rmd inside of your preferred software. It is good practice to save this file somewhere outside of the stat440-fa21 folder, e.g. your Desktop. Doing so ensures that you know which file you are working on and it reduces confusion about if the file is the original assignment or not. Again, we assume that this file does not already exist on your Desktop.
Now, complete the third problem by writing your solution beneath the Problem #3 wording. Then, render your reproducible document file to .html. Rendering to .html is also called “knitting”. See image below showing the knitted version.
Now that your solution to Problem #3 is saved locally, you want to practice submitting the assignment from your local machine to your individual student repo in GHE (not to the stat440-fa21-course-content repo). This is a form of self-collaboration. Recall that submitting an assignment in Git translates to committing and pushing the changes.
To submit your assignments, we must first connect your individual student repo to your local machine, which means we need to clone it.
Now, we are going to clone your individual student repo which is named as your netID. If you have files in your repo already, then to clone the repo, go to your netID repo on GHE. Next, click on the green “Code” button. Now, click the clipboard in order to copy the repo’s URL. See image below, which assumes you do not have any files in your repo.
If you have files in your repo, then you may see this image below.
Now, open RStudio, click on “File”, then “New Project…”, then “Version Control”, then “Git”.
Now, paste your repo’s URL (copied from above) in the “Repository URL” field. Type “netID” (your net ID) in the “Project Directory Name” field because that will be the name of the folder that is connected to your netID repo. In the “Create project as sub-directory of” field, select the Desktop; in the future, it could be anywhere of your choosing.
Next, we can verify whether that clone was successful by checking our computer’s Desktop for a folder named “netID” (your net ID), checking that RStudio has opened a new Rproj file named “netID” (your net ID), and that there is an accessible Git tab. Remember that we assume this netID folder didn’t exist before. If there was already a folder with that same name, then the clone would not have worked, and you would see this image below.
There should be nothing in the folder since we also assume that you did not attempt the first and second ways to interact with the course.
Great! You’ve successfully cloned the repo.
Now, you need to copy the homework00-netID.Rmd and homework00-netID.html files to your netID folder. This can be done using RStudio’s “Files” tab in the bottom-right pane. First, check the box next to homework00-netID.Rmd. Then, click on the Blue cog/wheel “More”, then click “Copy To”, then select the netID folder (which should be on your Desktop).
Now, repeat this for the homework00-netID.html file. Alternatively, since these two files are on your local machine, you can go directly to them outside of RStudio and copy/paste them to your netID folder.
Now that those assignment files are in our local remotes, we need to actually submit them to our individual student repo in GHE. To submit them means to commit and push them. To do that, go to the Git tab (assuming you are currently open in your netID.Rproj file), click on the “Commit” button which makes a pop-up interface. In this pop-up, 1) check the Staged box for the two files homework00-netID.Rmd and homework00-netID.html terminal (it may take a moment for check to appear), 2) write a commit message “Added two homework files using RStudio”, (a new pop-up will perform the commit action) and 3) click on the green up arrow “Push” (a new pop-up will perform the push action).
Great! We have successfully submitted our first (fake) homework assignment resulting in the image below. You can verify whether any commit and push has been successful by going to GHE and checking if the file is there in the location you intended and that the commit message is also present and correct.
You can submit assignments an unlimited number of times for your assignment submissions which is why I say you have unlimited submissions. Just be sure that your reproducible document file (either .Rmd or .ipynb) and rendered file (.html) are up to date with each other. It is not a good idea to complete the assignment in the .Rmd file, but forget to render it to .html.
Suppose you close RStudio and you need to do more work on an assignment. It is quite simple to return to a repo/Rproj by clicking on “File”, then “Open Project…”, then select the folder and .Rproj file that you need to return to. For example, select the netID.Rproj (your net ID) file, which is located within the netID (your net ID) folder within the Desktop.